<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />

<meta name="viewport" content="width=device-width, initial-scale=1">

<meta name="author" content="Wei Vivian Li, Jingyi Jessica Li" />

<meta name="date" content="2018-06-08" />

<title>Introduction to scImpute</title>



<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
  margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>



<link href="data:text/css;charset=utf-8,body%20%7B%0Abackground%2Dcolor%3A%20%23fff%3B%0Amargin%3A%201em%20auto%3B%0Amax%2Dwidth%3A%20700px%3B%0Aoverflow%3A%20visible%3B%0Apadding%2Dleft%3A%202em%3B%0Apadding%2Dright%3A%202em%3B%0Afont%2Dfamily%3A%20%22Open%20Sans%22%2C%20%22Helvetica%20Neue%22%2C%20Helvetica%2C%20Arial%2C%20sans%2Dserif%3B%0Afont%2Dsize%3A%2014px%3B%0Aline%2Dheight%3A%201%2E35%3B%0A%7D%0A%23header%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0A%23TOC%20%7B%0Aclear%3A%20both%3B%0Amargin%3A%200%200%2010px%2010px%3B%0Apadding%3A%204px%3B%0Awidth%3A%20400px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Aborder%2Dradius%3A%205px%3B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Afont%2Dsize%3A%2013px%3B%0Aline%2Dheight%3A%201%2E3%3B%0A%7D%0A%23TOC%20%2Etoctitle%20%7B%0Afont%2Dweight%3A%20bold%3B%0Afont%2Dsize%3A%2015px%3B%0Amargin%2Dleft%3A%205px%3B%0A%7D%0A%23TOC%20ul%20%7B%0Apadding%2Dleft%3A%2040px%3B%0Amargin%2Dleft%3A%20%2D1%2E5em%3B%0Amargin%2Dtop%3A%205px%3B%0Amargin%2Dbottom%3A%205px%3B%0A%7D%0A%23TOC%20ul%20ul%20%7B%0Amargin%2Dleft%3A%20%2D2em%3B%0A%7D%0A%23TOC%20li%20%7B%0Aline%2Dheight%3A%2016px%3B%0A%7D%0Atable%20%7B%0Amargin%3A%201em%20auto%3B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dcolor%3A%20%23DDDDDD%3B%0Aborder%2Dstyle%3A%20outset%3B%0Aborder%2Dcollapse%3A%20collapse%3B%0A%7D%0Atable%20th%20%7B%0Aborder%2Dwidth%3A%202px%3B%0Apadding%3A%205px%3B%0Aborder%2Dstyle%3A%20inset%3B%0A%7D%0Atable%20td%20%7B%0Aborder%2Dwidth%3A%201px%3B%0Aborder%2Dstyle%3A%20inset%3B%0Aline%2Dheight%3A%2018px%3B%0Apadding%3A%205px%205px%3B%0A%7D%0Atable%2C%20table%20th%2C%20table%20td%20%7B%0Aborder%2Dleft%2Dstyle%3A%20none%3B%0Aborder%2Dright%2Dstyle%3A%20none%3B%0A%7D%0Atable%20thead%2C%20table%20tr%2Eeven%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Ap%20%7B%0Amargin%3A%200%2E5em%200%3B%0A%7D%0Ablockquote%20%7B%0Abackground%2Dcolor%3A%20%23f6f6f6%3B%0Apadding%3A%200%2E25em%200%2E75em%3B%0A%7D%0Ahr%20%7B%0Aborder%2Dstyle%3A%20solid%3B%0Aborder%3A%20none%3B%0Aborder%2Dtop%3A%201px%20solid%20%23777%3B%0Amargin%3A%2028px%200%3B%0A%7D%0Adl%20%7B%0Amargin%2Dleft%3A%200%3B%0A%7D%0Adl%20dd%20%7B%0Amargin%2Dbottom%3A%2013px%3B%0Amargin%2Dleft%3A%2013px%3B%0A%7D%0Adl%20dt%20%7B%0Afont%2Dweight%3A%20bold%3B%0A%7D%0Aul%20%7B%0Amargin%2Dtop%3A%200%3B%0A%7D%0Aul%20li%20%7B%0Alist%2Dstyle%3A%20circle%20outside%3B%0A%7D%0Aul%20ul%20%7B%0Amargin%2Dbottom%3A%200%3B%0A%7D%0Apre%2C%20code%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0Aborder%2Dradius%3A%203px%3B%0Acolor%3A%20%23333%3B%0Awhite%2Dspace%3A%20pre%2Dwrap%3B%20%0A%7D%0Apre%20%7B%0Aborder%2Dradius%3A%203px%3B%0Amargin%3A%205px%200px%2010px%200px%3B%0Apadding%3A%2010px%3B%0A%7D%0Apre%3Anot%28%5Bclass%5D%29%20%7B%0Abackground%2Dcolor%3A%20%23f7f7f7%3B%0A%7D%0Acode%20%7B%0Afont%2Dfamily%3A%20Consolas%2C%20Monaco%2C%20%27Courier%20New%27%2C%20monospace%3B%0Afont%2Dsize%3A%2085%25%3B%0A%7D%0Ap%20%3E%20code%2C%20li%20%3E%20code%20%7B%0Apadding%3A%202px%200px%3B%0A%7D%0Adiv%2Efigure%20%7B%0Atext%2Dalign%3A%20center%3B%0A%7D%0Aimg%20%7B%0Abackground%2Dcolor%3A%20%23FFFFFF%3B%0Apadding%3A%202px%3B%0Aborder%3A%201px%20solid%20%23DDDDDD%3B%0Aborder%2Dradius%3A%203px%3B%0Aborder%3A%201px%20solid%20%23CCCCCC%3B%0Amargin%3A%200%205px%3B%0A%7D%0Ah1%20%7B%0Amargin%2Dtop%3A%200%3B%0Afont%2Dsize%3A%2035px%3B%0Aline%2Dheight%3A%2040px%3B%0A%7D%0Ah2%20%7B%0Aborder%2Dbottom%3A%204px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Apadding%2Dbottom%3A%202px%3B%0Afont%2Dsize%3A%20145%25%3B%0A%7D%0Ah3%20%7B%0Aborder%2Dbottom%3A%202px%20solid%20%23f7f7f7%3B%0Apadding%2Dtop%3A%2010px%3B%0Afont%2Dsize%3A%20120%25%3B%0A%7D%0Ah4%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23f7f7f7%3B%0Amargin%2Dleft%3A%208px%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Ah5%2C%20h6%20%7B%0Aborder%2Dbottom%3A%201px%20solid%20%23ccc%3B%0Afont%2Dsize%3A%20105%25%3B%0A%7D%0Aa%20%7B%0Acolor%3A%20%230033dd%3B%0Atext%2Ddecoration%3A%20none%3B%0A%7D%0Aa%3Ahover%20%7B%0Acolor%3A%20%236666ff%3B%20%7D%0Aa%3Avisited%20%7B%0Acolor%3A%20%23800080%3B%20%7D%0Aa%3Avisited%3Ahover%20%7B%0Acolor%3A%20%23BB00BB%3B%20%7D%0Aa%5Bhref%5E%3D%22http%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0Aa%5Bhref%5E%3D%22https%3A%22%5D%20%7B%0Atext%2Ddecoration%3A%20underline%3B%20%7D%0A%0Acode%20%3E%20span%2Ekw%20%7B%20color%3A%20%23555%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Edt%20%7B%20color%3A%20%23902000%3B%20%7D%20%0Acode%20%3E%20span%2Edv%20%7B%20color%3A%20%2340a070%3B%20%7D%20%0Acode%20%3E%20span%2Ebn%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Efl%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Ech%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Est%20%7B%20color%3A%20%23d14%3B%20%7D%20%0Acode%20%3E%20span%2Eco%20%7B%20color%3A%20%23888888%3B%20font%2Dstyle%3A%20italic%3B%20%7D%20%0Acode%20%3E%20span%2Eot%20%7B%20color%3A%20%23007020%3B%20%7D%20%0Acode%20%3E%20span%2Eal%20%7B%20color%3A%20%23ff0000%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%0Acode%20%3E%20span%2Efu%20%7B%20color%3A%20%23900%3B%20font%2Dweight%3A%20bold%3B%20%7D%20%20code%20%3E%20span%2Eer%20%7B%20color%3A%20%23a61717%3B%20background%2Dcolor%3A%20%23e3d2d2%3B%20%7D%20%0A" rel="stylesheet" type="text/css" />

</head>

<body>




<h1 class="title toc-ignore">Introduction to scImpute</h1>
<h4 class="author"><em>Wei Vivian Li, Jingyi Jessica Li</em></h4>
<h4 class="date"><em>2018-06-08</em></h4>



<p>The emerging single cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscape at single-cell resolution. However, scRNA-seq analysis is complicated by the excess of zero or near zero counts in the data, which are the so-called dropouts due to low amounts of mRNA within each individual cell. Consequently, downstream analysis of scRNA-seq woule be severely biased if the dropout events are not properly corrected. <code>scImpute</code> is developed to accurately and efficiently impute the dropout values in scRNA-seq data.</p>
<p><code>scImpute</code> can be applied to raw data count before the users perform downstream analyses such as</p>
<ul>
<li>dimension reduction of scRNA-seq data</li>
<li>normalization of scRNA-seq data</li>
<li>clustering of cell populations</li>
<li>differential gene expression analysis</li>
<li>time-series analysis of gene expression dynamics</li>
</ul>
<div id="quick-start" class="section level2">
<h2>Quick start</h2>
<p><code>scImpute</code> can be easily incorporated into existing pipeline of scRNA-seq analysis. Its only input is the raw count matrix with rows representing genes and columns representing cells. It will output an imputed count matrix with the same dimension. In the simplest case, the imputation task can be done with one single function <code>scimpute</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">scimpute</span>(<span class="co"># full path to raw count matrix</span>
         <span class="dt">count_path =</span> <span class="kw">system.file</span>(<span class="st">&quot;extdata&quot;</span>, <span class="st">&quot;raw_count.csv&quot;</span>, <span class="dt">package =</span> <span class="st">&quot;scImpute&quot;</span>), 
         <span class="dt">infile =</span> <span class="st">&quot;csv&quot;</span>,           <span class="co"># format of input file</span>
         <span class="dt">outfile =</span> <span class="st">&quot;csv&quot;</span>,          <span class="co"># format of output file</span>
         <span class="dt">out_dir =</span> <span class="st">&quot;./&quot;</span>,           <span class="co"># full path to output directory</span>
         <span class="dt">labeled =</span> <span class="ot">FALSE</span>,          <span class="co"># cell type labels not available</span>
         <span class="dt">drop_thre =</span> <span class="fl">0.5</span>,          <span class="co"># threshold set on dropout probability</span>
         <span class="dt">Kcluster =</span> <span class="dv">2</span>,             <span class="co"># 2 cell subpopulations</span>
         <span class="dt">ncores =</span> <span class="dv">10</span>)              <span class="co"># number of cores used in parallel computation</span></code></pre></div>
<p>This function returns the column indices of outlier cells, and creates a new file <code>scImpute_count.csv</code> in <code>out_dir</code> to store the imputed count matrix.</p>
</div>
<div id="step-by-step-description" class="section level2">
<h2>Step-by-step description</h2>
<p>The input file can be a <code>.csv</code> file, <code>.txt</code> file, or <code>.rds</code> file. In all cases, the <strong>first column</strong> should give the gene names and the <strong>first row</strong> should give the cell names. We use the example files in the package as illustration. If the raw counts are stored in a <code>.csv</code> file, and we also hope to output the imputed matrix into a <code>.csv</code> file, then specify this information with</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># full path of the input file</span>
count_path =<span class="st"> </span><span class="kw">system.file</span>(<span class="st">&quot;extdata&quot;</span>, <span class="st">&quot;raw_count.csv&quot;</span>, <span class="dt">package =</span> <span class="st">&quot;scImpute&quot;</span>)
infile =<span class="st"> &quot;csv&quot;</span>
outfile =<span class="st"> &quot;csv&quot;</span></code></pre></div>
<p>Similarly, If the raw counts are stored in a <code>.txt</code> file, and we also hope to output the imputed matrix into a <code>.txt</code> file, then specify this information with</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># full path of the input file</span>
count_path =<span class="st"> </span><span class="kw">system.file</span>(<span class="st">&quot;extdata&quot;</span>, <span class="st">&quot;raw_count.txt&quot;</span>, <span class="dt">package =</span> <span class="st">&quot;scImpute&quot;</span>)
infile =<span class="st"> &quot;txt&quot;</span>
outfile =<span class="st"> &quot;txt&quot;</span></code></pre></div>
<p>Next, we need to set up the directory to store all the temporary and final outputs:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># a '/' sign is necessary at the end of the path</span>
out_dir =<span class="st"> &quot;~/output/&quot;</span></code></pre></div>
<p>We highly recommend using parallel computing with <code>scImpute</code>, which will significantly reduce the computation time. Suppose we would like to use 20 cores, then we can run the <code>scImpute</code> function with <code>ncores = 20</code>.</p>
<p><code>scImpute</code> has two statistical parameters. The <strong>first parameter is <code>Kcluster</code></strong>, which determines the <strong>number of initial clusters</strong> to help identify candidate neighbors of each cell. The imputation results does not heavily rely on the choice of <code>Kcluster</code>, since <code>scImpute</code> uses a model-based method to select similar cells in a later stage. <code>Kcluster</code> can be specified based on the number of known cell types and users’ biological expertise, and it may also be learned by clustering the raw data and inspecting the clustering results. The <strong>second parameter</strong> is <code>drop_thre</code>. Only the values that have <strong>dropout probability</strong> larger than <code>drop_thre</code> are imputed by <code>scImpute</code>. A default threshold <code>drop_thre = 0.5</code> is sufficient for most scRNA-seq data.</p>
<p>Now to get the imputed matrix, all we need is the main <code>scimpute</code> function</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">Kcluster =<span class="st"> </span><span class="dv">2</span>
drop_thre =<span class="st"> </span><span class="fl">0.5</span>
ncores =<span class="st"> </span><span class="dv">10</span>
<span class="kw">scimpute</span>(count_path, infile, outfile, out_dir, <span class="dt">labeled =</span> <span class="ot">FALSE</span>,  drop_thre, Kcluster, ncores)  </code></pre></div>
<p>If <code>outfile = &quot;csv&quot;</code>, this function will create a new file <code>scimpute_count.csv</code> in <code>out_dir</code> to store the imputed count matrix; if <code>outfile = &quot;txt&quot;</code>, this function will create a new file <code>scimpute_count.txt</code> in <code>out_dir</code>.</p>
<p>Note that the order of parameters matters in R functions, so we suggest using the format in <strong>Quick start</strong> to specify parameters and avoid mistakes. If the users would like to apply <code>scImpute</code> on data coming from homogeneous cells, this can be achieved by setting <code>Kcluster = 1</code> and <code>labeled = FALSE</code>.</p>
</div>
<div id="apply-scimpute-with-cell-type-information" class="section level2">
<h2>Apply <code>scImpute</code> with cell type information</h2>
<p>Sometimes users may have the cell type (or subpopulation) information of the single cells and <code>scimpute</code> can take advantage of this information to impute among each cell type. To do this, we need a character vector <code>labels</code> specifying the cell type of each column in the raw count matrix. In other words, the length of <code>labels</code> equals the number of cells and the order of elements in <code>labels</code> should match the order of columns in the raw count matrix. Then we just need to specify <code>labeled = TRUE</code> in <code>scimpute</code> (default is <code>FALSE</code>) and specify the <code>labels</code> argument. <code>Kcluster</code> is not used when <code>labeled = TRUE</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">labels =<span class="st"> </span><span class="kw">readRDS</span>(<span class="kw">system.file</span>(<span class="st">&quot;extdata&quot;</span>, <span class="st">&quot;labels.rds&quot;</span>, <span class="dt">package =</span> <span class="st">&quot;scImpute&quot;</span>))
labels[<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>]
<span class="op">&gt;</span><span class="st"> </span>[<span class="dv">1</span>] <span class="st">&quot;c1&quot;</span> <span class="st">&quot;c1&quot;</span> <span class="st">&quot;c1&quot;</span> <span class="st">&quot;c2&quot;</span> <span class="st">&quot;c2&quot;</span>

<span class="kw">scimpute</span>(count_path, 
         <span class="dt">infile =</span> <span class="st">&quot;csv&quot;</span>, 
         <span class="dt">outfile =</span> <span class="st">&quot;csv&quot;</span>, 
         <span class="dt">out_dir =</span> out_dir,
         <span class="dt">labeled =</span> <span class="ot">TRUE</span>, 
         <span class="dt">drop_thre =</span> <span class="fl">0.5</span>,
         <span class="dt">labels =</span> labels, 
         <span class="dt">ncores =</span> <span class="dv">10</span>)</code></pre></div>
</div>
<div id="apply-scimpute-to-tpm-values" class="section level2">
<h2>Apply <code>scImpute</code> to TPM values</h2>
<p>We strongly suggest using <code>scImpute</code> on count matrices. However, if only TPM values are available, users can apply <code>scImpute</code> with gene lengths supplied. <code>scImpute</code> will use the gene lengths (sum of exon lengths) to scale the data , which ensures a good fitting of the mixture models. In this case, users need to specify <code>type = &quot;TPM&quot;</code> (<code>type = &quot;count&quot;</code> by default), and supply a vector <code>genelen</code> of gene lengths. The order of genes in <code>genelen</code> should match the order in the expression matrix. For example:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="op">&gt;</span><span class="st"> </span>genelen[<span class="dv">1</span><span class="op">:</span><span class="dv">3</span>]
ENSMUSG00000021252 ENSMUSG00000007777 ENSMUSG00000024442
              <span class="dv">4235</span>                <span class="dv">998</span>               <span class="dv">2404</span>

<span class="kw">scimpute</span>(count_path, 
         <span class="dt">infile =</span> <span class="st">&quot;csv&quot;</span>, 
         <span class="dt">outfile =</span> <span class="st">&quot;csv&quot;</span>, 
         <span class="dt">out_dir =</span> out_dir,
         <span class="dt">type =</span> <span class="st">&quot;TPM&quot;</span>
         <span class="dt">genelen =</span> genelen,
         <span class="dt">drop_thre =</span> <span class="fl">0.5</span>,
         <span class="dt">ncores =</span> <span class="dv">10</span>)</code></pre></div>
</div>
<div id="how-to-save-computation-time-with-scimpute" class="section level2">
<h2>How to save computation time with <code>scImpute</code></h2>
<p><code>scImpute</code> benefits from parallel computation, and each processor does not require heavy memory cost. <code>scimpute</code> completes computation in seconds when applied to a dataset with 10,000 genes and 100 cells, running with 10 cores. The memory requirement for this data set is around 2G. The running time mostly depends on</p>
<ul>
<li>number of processors (<code>ncores</code>)</li>
<li>number of cells in the scRNA-seq data</li>
</ul>
<p>When the number of cells is extremely large, a filtering step on the cells can save the computation time of <code>scImpute</code>.</p>
</div>



<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

</body>
</html>
